Problem: Predicting Credit Card Fraud

Introduction to business scenario

You work for a multinational bank. There has been a significant increase in the number of customers experiencing credit card fraud over the last few months. A major news outlet even recently published a story about the credit card fraud you and other banks are experiencing.

As a response to this situation, you have been tasked to solve part of this problem by leveraging machine learning to identify fraudulent credit card transactions before they have a larger impact on your company. You have been given access to a dataset of past credit card transactions, which you can use to train a machine learning model to predict if transactions are fraudulent or not.

About this dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred over the course of two days and includes examples of both fraudulent and legitimate transactions.

Features

The dataset contains over 30 numerical features, most of which have undergone principal component analysis (PCA) transformations because of personal privacy issues with the data. The only features that have not been transformed with PCA are 'Time' and 'Amount'. The feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction amount. 'Class' is the response or target variable, and it takes a value of '1' in cases of fraud and '0' otherwise.

Features: V1, V2, ... V28: Principal components obtained with PCA

Non-PCA features:

Dataset attributions

Website: https://www.openml.org/d/1597

Twitter: https://twitter.com/dalpozz/status/645542397569593344

Authors: Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi Source: Credit card fraud detection - June 25, 2015 Official citation: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group (mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.

Step 1: Problem formulation and data collection

Start this project off by writing a few sentences below that summarize the business problem and the business goal you're trying to achieve in this scenario. Include a business metric you would like your team to aspire toward. With that information defined, clearly write out the machine learning problem statement. Finally, add a comment or two about the type of machine learning this represents.

Read through a business scenario and:

1. Determine if and why ML is an appropriate solution to deploy.

As we have a label historical dataset, and it is difficult to human to indicate wether it is a Fraud or not as the number of features and transaction is so large.

2. Formulate the business problem, success metrics, and desired ML output.

Identify fraudulent credit card transactions before they have a larger impact on a company.
Decrease in the number of customers experiencing credit card fraud.
Predict if there is a fraud or not.

3. Identify the type of ML problem you’re dealing with.

Binary classification

4. Analyze the appropriateness of the data you’re working with.

Good data for ML to work on.

Setup

Now that we have decided where to focus our energy, let's set things up so you can start working on solving the problem.

Note: This notebook was created and tested on an ml.m4.xlarge notebook instance.

Downloading the dataset

Step 2: Data preprocessing and visualization

In this data preprocessing phase, you should take the opportunity to explore and visualize your data to better understand it. First, import the necessary libraries and read the data into a Pandas dataframe. After that, explore your data. Look for the shape of the dataset and explore your columns and the types of columns you're working with (numerical, categorical). Consider performing basic statistics on the features to get a sense of feature means and ranges. Take a close look at your target column and determine its distribution.

Specific questions to consider

  1. What can you deduce from the basic statistics you ran on the features?

  2. What can you deduce from the distributions of the target classes?

  3. Is there anything else you deduced from exploring the data?

Read the CSV data into a Pandas dataframe.

Check the dataframe by printing the first 5 rows of the dataset.

Check the dataframe Length.

Check for Duplicate.

Information about the dataframe.

Columns' type, Null values.

Convert Class to integer.

Task: Validate all the columns in the dataset and see that they are what you read above: V1-V28, Time, Amount, and Class.

Basic statistics

It seems that the Amount is left skewed.

Now look at the target variable, Class.

First, we can find out what the distribution is for it.

Fraud= 473
legitimate = 283253
Data is highly imbalanced

Assumption

As we know this dataset contains transactions made by credit cards in September 2013.
We assume the first transaction happen in Monday, September 2, 2013 12:00:00 AM.
So the Epoch timestamp is 1378080000.

Convert Time to timestamp so we can convert it later to Date.

Visualize the data

Specific questions to consider

  1. After looking at the distributions of features like Amount and Time, to what extend might those features help your model? Is there anything you can deduce from those distributions that might be helpful in better understanding your data?

  2. Do the distributions of features like Amount and Time differ when you are looking only at data that is labeled as fraud?

  3. Are there any features in your dataset that are strongly correlated? If so, what would be your next steps?

Resample the data by hour

Distribution of Time differ when we are looking only at data that is labeled as fraud
As we see number of transactions is low at 12am to 6am, then it increase, in fraud case it's random no exact pattern here.

We neglict the Day, and just consider the hour

As we see the number of transactions starts increasing at 4:00 AM until it reach 10:00 PM, it begins to decrease again, but in fraud case as we mentioned earlier, it's random distribution and there is no exact pattern here.

The fraud happen at both Day and night with small difference between them
It is not easy to differentiate the fraud and not fraud

Distribution of Amount is still the same when we are looking only at data that is labeled as fraud

Now let's look at a distribution using a Seaborn function called pairplot. pairplot creates a grid of scatterplots, such that each feature in the dataset is used once as the X-axis and once as the Y-axis. The diagonal of this grid shows a distribution of the data for that feature.

Look at V1, V2, V2, V4, and Class pairplots. What do you see in the plots? Can you differentiate the fraud and not fraud from these features?

Hint: Create a new dataframe with columns V1, V2, V4, and Class.

You can see for the smaller subset of the features that we used, there is a way to differentiate the fraud and not fraud,
but it's not easy to separate it based on any one feature.

How features interact with each other.

Check correlation between features generated by PCA

As we expected There is no Correlation between v1 and v2 as they generated by PCA.

There is no strong correlation here.

Step 3: Model training and evaluation

There are some preliminary steps that you have to include when converting the dataset from a DataFrame to a format that a machine learning algorithm can use. For Amazon SageMaker, here are the steps you need to take:

  1. Split the data into train_data, validation_data, and test_data using sklearn.model_selection.train_test_split.
  2. Convert the dataset to an appropriate file format that the Amazon SageMaker training job can use. This can be either a CSV file or record protobuf. For more information, see Common Data Formats for Training.
  3. Upload the data to your Amazon S3 bucket.

Note: If more than one role is required for notebook instances, training, and/or hosting, replace the get_execution_role() call with the appropriate full IAM role ARN string(s).

Replace <LabBucketName> with the resource name that you have.

Model training

Lets start by instantiating the LinearLearner estimator with predictor_type='binary_classifier' parameter with one ml.m4.xlarge instance.

Linear learner accepts training data in protobuf or CSV content types, and accepts inference requests in protobuf, CSV, or JSON content types.
Training data has features and ground-truth labels, while the data in an inference request has only features.
In a production pipeline, it is recommended to convert the data to the Amazon SageMaker protobuf format and store it in Amazon S3.

However, to get up and running quickly, AWS provides the convenient method record_set for converting and uploading when the dataset is small enough to fit in local memory. It accepts NumPy arrays like the ones you already have, so let's use it here

The RecordSet object will keep track of the temporary Amazon S3 location of your data.
Use the estimator.record_set function to create train, validation, and test records.
Then, use the estimator.fit function to start your training job.

Model evaluation

In this section, you'll evaluate your trained model. First, use the estimator.deploy function with initial_instance_count= 1 and instance_type= 'ml.m4.xlarge' to deploy your model on Amazon SageMaker.

Now that you have a hosted endpoint running, you can make real-time predictions from the model easily by making an http POST request.
But first, you'll need to

set up serializers and deserializers for passing your test_features NumPy arrays to the model behind the endpoint.
You will also calculate the confusion matrix for your model to evaluate how it has done on your test data visually.

Similar to the test set, you can also look at the metrics for the training set. Keep in mind that those are also shown to you above in the logs.

Key questions to consider:

  1. How does your model's performance on the test set compare to the training set? What can you deduce from this comparison?

    It seem the same resulton test and training, we can deduce that the model is underfitting and still need to more learning/training.

  1. Are there obvious differences between the outcomes of metrics like accuracy, precision, and recall? If so, why might you be seeing those differences?

    Yes, the percesion is larger than the others, as the data imbalance it seem logically to get high value with percision than recall because if we have 1000 datapoints 100 Fraud and 900 not fraud at the worst case if the model predict all as not fraud the accuracy will be 90% and recall/percsion will be 0.

  1. Given your business situation and goals, which metric(s) is most important for you to consider here? Why?

    Recall, as the cost of predicting someone not Fraud although he is, more than predicting someone fraud and he is'nt.

  1. Is the outcome for the metric(s) you consider most important sufficient for what you need from a business standpoint? If not, what are some things you might change in your next iteration (in the feature engineering section, which is coming up next)?

    No, Try to resample the dataset, remove correlation if exist and try to rescale the features.

Step 4: Feature engineering

You've now gone through one iteration of training and evaluating your model. Given that the outcome you reached for your model the first time probably wasn't sufficient for solving your business problem, what are some things you could change about your data to possibly improve model performance?

Key questions to consider:

  1. How might the balance of your two main classes (fraud and not fraud) impact model performance?
  2. Does balancing your dataset have any impact on correlations between your features?
  3. Are there feature reduction techniques you could perform at this stage that might have a positive impact on model performance?
  4. After performing some feature engineering, how does your model performance compare to the first iteration?

The accuracy is calculated with how many examples the model got right. However, most of the examples are actually negative, so if you actually predict all examples as zero in this very imbalanced dataset, you can still get an accuracy of about 99.827%. Having an imbalanced dataset may cause some problems with algorithm performance. So it's useful to treat the imbalance in the data before you train the model.

Question: How do you solve the problem of dataset imbalance?

Use sns.countplot to plot the original distribution of the dataset.

Convert train_features back into a DataFrame.

There are two main ways to handle imbalanced datasets:

You can use a library called Imbalanced-learn for sampling the datasets. imbalanced-learn is a Python package offering a number of resampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. For more information, see imbalanced-learn API documentation.

Choose undersampling for this example first. To create the balanced dataset:

  1. Create a new DataFrame fraud_df with all the positive examples.
  2. Create another DataFrame non_fraud_df and use dataframe.sample with the same number as the fraud_df DataFrame and random_state=235.
  3. Concatenate both DataFrames into a new DataFrame balanced_df.

Check the distribution and shape again using sns.countplot().

Before looking at the training, look at what will happen if you use a feature reduction technique like t-Distributed Stochastic Neighbor Embedding (t-SNE) on the dataset.

Question: Does t-SNE help you differentiate the fraud from not fraud?

Answer: t-SNE can differentiate between some of the fraud and not fraud cases, but it doesn't do a good job of completely differentiating the two when we reduce the data to just two dimensions.

Now that you have the new data, compare what the correlation matrix looks like before and after.

Question: What can you deduce from looking at the different correlation matrices? If you see a difference, can you analyze why there is a difference?

Answer: In the unbalanced set, the correlation matrix didn't show any linear relationships, but in this smaller balanced dataset, you can see some linear relationships.

Question: Would you drop any columns because of the correlated data?

Answer: Yes, remove any features that have more than 85% correlation.

Because there are some correlations, let's remove the correlated data that has more than 0.89 correlation.

Now it's time to train, deploy, and evaluate using the new balanced dataset.

Reducing the number of examples made the recall to go up. Let's try a different strategy, because we need a high recall.

Optional: Convert the new dataset to a Pandas DataFrame and check the shape and distribution of the data.

Check for correlation between features after oversampling

As we see here, there are more correlations after oversampling the data than undersampling it.
So we will remove Again correlated features with 0.89 value.

Create new train, test, and validation datasets.

Train your model using the new dataset.

Increasing the number of examples made the recall to go up 0.929.

Try standerscaling

Standerscaling don't affect the recall 0.929.

Try MinMaxScaler

mean Target increase the recall 0.930.

Hyperparameter optimization

Another part of the model tuning phase is to perform hyperparameter optimization. This section gives you an opportunity to tune your hyperparameters to see the extent to which tuning improves your model performance. Use the following template code to help you get started launching an Amazon SageMaker hyperparameter tuning job and viewing the evaluation metrics. Use the following questions to help guide you through the rest of the section.

Key questions to consider:

  1. How does the outcome of your objective metric of choice change as timing of your tuning job increases? What's the relationship between the different objective metrics you're getting and time?
  2. What is the correlation between your objective metric and the individual hyperparameters? Is there a hyperparameter that has a strong correlation with your objective metric? If so, what might you do to leverage this strong correlation?
  3. Analyze the performance of your model after hyperparameter tuning. Is current performance sufficient for what you need to solve your business problem?

Project presentation: Record key decisions and methods you use in this section in your project presentations, as well as any new performance metrics you obtain after evaluating your model again.

Track hyperparameter tuning job progress

After you launch a tuning job, you can see its progress by calling the describe_tuning_job API. The output from describe-tuning-job is a JSON object that contains information about the current state of the tuning job. You can call list_training_jobs_for_tuning_job to see a detailed list of the training jobs that the tuning job launched.

Fetch all results as a DataFrame

You can list hyperparameters and objective metrics of all training jobs and pick up the training job with the best objective metric.

Deploy this as your final model and evaluate it on the test set.

Then, because you're training with the CSV file format, create s3_inputs that the training function can use as a pointer to the files in Amazon S3.